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Abstract 

The widespread use of smart devices gives rise to both 
security and privacy concerns. Fingerprinting smart de- 
vices can assist in authenticating physical devices, but 
it can also jeopardize privacy by allowing remote iden- 
tification without user awareness. We propose a novel 
fingerprinting approach that uses the microphones and 
speakers of smart phones to uniquely identify an indi- 
vidual device. During fabrication, subtle imperfections 
arise in device microphones and speakers which induce 
anomalies in produced and received sounds. We ex- 
ploit this observation to fingerprint smart devices through 
playback and recording of audio samples. We use audio- 
metric tools to analyze and explore different acoustic fea- 
tures and analyze their ability to successfully fingerprint 
smart devices. Our experiments show that it is even pos- 
sible to fingerprint devices that have the same vendor and 
model; we were able to accurately distinguish over 93% 
of all recorded audio clips from 15 different units of the 
same model. Our study identifies the prominent acoustic 
features capable of fingerprinting devices with high suc- 
cess rate and examines the effect of background noise 
and other variables on fingerprinting accuracy. 

1 Introduction 

Mobile devices, including smartphones, PDAs, and 
tablets, are quickly becoming widespread in modern so- 
ciety. In 2012 a total of 1.94 billion mobile devices were 
shipped, of which 75% were smart and highly-featured 
phones |6][9][15). Canalys predicted that the mobile de- 
vice market will reach 2.6 billion units by 2016, with 
smartphones and tablets continuing to dominate ship- 
ments [15 1 . The rapid uptake of intelligent mobile de- 
vices is not surprising, due to the numerous advantages 
they provide consumers, from entertainment and social 
applications to business and advanced computing capa- 
bilities. However, smartphones, with all their interactive, 



location-centric, and connectivity-based features impose 
threatening concerns on user privacy and information se- 
curity. There has been a large body of research work 
highlighting and proposing solutions for privacy and se- 
curity issues of smartphones 1 10, 36, 37, 42, 60, 73, 82]. 
All these works center around securing the software, in- 
cluding the operating system and network stack, of mo- 
bile devices, for example by instilling fine-grained ac- 
cess control policies, or restricting dataflow, containing 
private data, to a network sink. 

In this paper we propose a novel technique for finger- 
printing the hardware of smartphones. The observation 
is that even if the software on mobile devices is strength- 
ened, hardware-level idiosyncrasies in microphones and 
speaker can be used to fingerprint physical devices. Dur- 
ing manufacturing, imperfections are introduced in the 
analog circuitry of these components, and as such, two 
microphones and speakers are never alike. Through an 
observational study, we find that these imperfections are 
substantial enough, and prevalent enough, that we can re- 
liably distinguish between devices by passively observ- 
ing audio, and conducting a simple spectral analysis on 
the recorded audio. Our approach can substantially sim- 
plify the ability for an adversary to track and identify 
people in public locations, identify callers, and produce 
other threats to the security and privacy of mobile device 
users. Our approach works well even with few samples 
— for example, we show that with our techniques, an ad- 
versary could even use the short ringtones produces by 
mobile device speakers to reliably track users in public 
environments. 

Our approach centers around development of a novel 
fingerprinting mechanism, which aims to "pull out" im- 
perfections in device circuitry. Our mechanism has two 
parts: a method to extract auditory fingerprints and a 
method to efficiently search for matching fingerprints 
from a database. To generate fingerprints of speakers 
we record audio clips played from smartphones on an 
external device (i.e., laptop/PC) and vice versa for gen- 



erating fingerprints of microphones. We use two dif- 
ferent classifiers to evaluate our fingerprinting approach. 
Moreover, we test our fingerprinting approach for dif- 
ferent genre of audio clips at various frequencies. We 
also elaborately study various audio features that can be 
used to accurately fingerprint smartphones. Our study 
reveals that mel-frequency cepstral coefficient (MFCC) 
is the dominant feature for fingerprinting smartphones. 
We also analyze the sensitivity of our fingerprinting ap- 
proach against different factors like sampling frequency, 
distance between speaker and recorder, training set size 
and ambient background noise. 

Contributions. We offer the following contributions: 

• We propose a novel approach to fingerprinting 
smart devices. Our approach leverages the manu- 
facturing idiosyncrasies of microphones and speak- 
ers embedded in smart devices. 

• We study feasibility of a spectrum of existing audio 
features that can be used to accurately fingerprint 
smartphones. We find that the mel-frequency cep- 
stral coefficient (MFCC) performs particularly well 
for fingerprinting smartphones. 

• We investigate two different classifiers to evalu- 
ate our fingerprinting approach. We conclude that 
Gaussian Mixture Models (GMM) are more effec- 
tive in classifying our recorded audio fingerprints. 

• We perform experiments across several different 
genres of audio excerpts. We also analyze how dif- 
ferent factors like sampling frequency, distance be- 
tween speaker and recorder, training set size and 
ambient background noise impact the accuracy of 
our fingerprinting. 

• Finally, we discuss how our fingerprinting approach 
can be used as an additional factor for authentica- 
tion. 

Roadmap. The remainder of this paper is organized as 
follows. Section [2] gives an overview of our fingerprint- 
ing approach. We discuss why microphones and speakers 
integrated in smartphones can be used to generate unique 
fingerprints in Section [3] Section [4] describes the differ- 
ent audio features considered in our experiments, along 
with the classification algorithms used in our evaluation. 
Section [5] elaborately presents our experimental results. 
We discuss two diametric applications of our device fin- 
gerprinting in Section[6] We describe some related works 
in Section|7l Section |8]discusses some limitations of our 
approach. Finally we conclude in Section [9] 

2 Overview 

In this section we give an overview of our approach, and 
identify the key challenges that we address in this paper. 



The key insight behind our work is that imperfections 
in smart device hardware induce unique signatures on re- 
ceived/transmitted audio, and these unique signatures, if 
identified, can be used to fingerprint the device. Our ap- 
proach consists of three key components. The first chal- 
lenge we encounter is acquiring a set of audio samples 
for analysis in the first place. To do this, we have a 
listener module, responsible for receiving and recording 
device audio. The listener module could be deployed as 
an application on the smart device (many mobile OSes 
allow direct access to microphone inputs), or as a stand 
alone (e.g., the adversary has a microphone in a public 
setting to pick up device ringtones). The next challenge 
is to effectively identify device signatures from the re- 
ceived audio stream. To do this, we have an analyzer 
module, which leverages signal processing techniques to 
localize spectral anomalies, and construct a 'fingerprint' 
of the auditory characteristics of the device. 

A key question that remains, which forms a major fo- 
cus of this paper, is in construction of an effective finger- 
printing scheme. Our goal is to determine a scheme that 
maximizes the ability to distinguish different devices. To 
do this, it helps to have some understanding of how de- 
vices differ at a physical level. Devices can vary at differ- 
ent layers of the manufacturing process. The most obvi- 
ous way to distinguish devices manufactured by different 
vendors is to analyze the protocol stack installed in the 
devices. Usually different vendors have their own dis- 
tinct features integrated inside the protocol stack. A close 
analysis of the protocol stack can help in distinguishing 
devices from different vendors. However, this approach 
is not helpful in distinguishing devices produced by the 
same vendor. To distinguish devices produced by the 
same vendor we need to look more deeply into the de- 
vices themselves because at the hardware level no two 
device are same. Hardware imperfections are likely to 
arise during the manufacturing process of sensors, radio 
transmitters and crystal oscillators suggesting the exis- 
tence of unique fingerprints. This idiosyncrasies can be 
exploited to distinguish devices. Figure [T] illustrates the 
different device specific features that could be utilized to 
identify devices uniquely. We investigate properties of 
device hardware in more detail in Section [3] 

A second aspect to this question is what sort of au- 
dio analysis techniques are most effective in identify- 
ing unique signatures of device hardware. There are a 
large number of audio properties which could be used 
(spectral entropy, zero crossings, pitch, etc.) as well 
as a broad spectrum of analysis algorithms that can be 
used to summarize these properties (principle component 
analysis, linear discriminant analysis, feature selection, 
etc.). We will study alternative properties to character- 
ize hardware-induced auditory anomalies in Section [4~T| 
as well as algorithms for effectively clustering them in 
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Figure 1: Device specific features that can be exploited to 
uniquely distinguish devices. 



cessing. The back-chamber acts as a acoustic resonator 
and the ventilation hole allows the air compressed inside 
the back chamber to flow out, allowing the diaphragm to 
move back into its original place. 

The sensitivity of the microphone depends on how 
well the diaphragm deflects to acoustic pressure; it also 
depends on the gap between the static back-plate and 
the flexible diaphragm. Unfortunately, even though the 
manufacturing process of these microphones has been 
streamlined, no two chips roll off the assembly line func- 
tioning in exactly the same waj[j] While subtle imper- 
fections in the microphone chips may go unnoticed by 
human ears, computationally such discrepancies may be 
sufficient to discriminate them, as we later show. 



3 Source of Fingerprints 



3.2 Closer Look at Microspeakers 



In this section we will take a closer look at the micro- 
phones and speakers embedded on today's smartphones. 
This will help understand how microphones and speakers 
can act as a potential source of unique fingerprints. 

3.1 Closer Look at Microphones 

Microphones in modern smartphones are based on Micro 
Electro Mechanical Systems (MEMS) Ql 1|13|18| |. To en- 
hance active noise and echo canceling capabilities, most 
smartphones today have more than one MEMS micro- 
phone. For example, the iPhone 5 has a total of three em- 
bedded MEMS microphones JTTJ. According to the IHS- 
iSuppli report, Apple and Samsung were the top con- 
sumers of MEMS microphones in 2012, accounting for a 
combined 54% of all shipped MEMS microphones p8) . 

A MEMS microphone, sometimes called a micro- 
phone chip or silicon microphone, consists of a coil-less 
pressure-sensitive diaphragm directly etched into a sili- 
con chip. It is comprised of a MEMS die and a comple- 
mentary metal-oxide-semiconductor (CMOS) die com- 
bined in an acoustic housing ]8|1 2[ . The CMOS often in- 
cludes both a preamplifier as well as an analog-to-digital 
(AD) converter. Modern fabrication techniques enable 
highly compact deigns, making them well suited for in- 
tegration in digital mobile devices. The internal archi- 
tecture of a MEMS microphone is shown on Figure [2] 
From the figure we can see that the MEMS microphone's 
physical design is based on a variable capacitor consist- 
ing of a highly flexible diaphragm in close proximity to 
a perforated, rigid back-plate. The perforations permit 
the air between the diaphragm and back-plate to escape. 
When an acoustic signal reaches the diaphragm through 
the acoustic holes, the diaphragm is set in motion. This 
mechanical deformation causes capacitive change which 
in turn causes voltage change. In this way sound pres- 
sure is converted into an electrical signal for further pro- 



Micro-speakers are a scaled down version of a basic 
acoustic speaker. So lets first look at how speakers work 
before we discuss how microspeakers can be used to gen- 
erate unique fingerprints. Figure |3ja) shows the basic 
components of a speaker. The diaphragm is usually made 
of paper, plastic or metal and its edges are connected to 
the suspension. The suspension is a rim of flexible ma- 
terial that allows the diaphragm to move. The narrow 
end of the diaphragms cone is connected to the voice 
coil. The voice coil is attached to the basket by a spider 
(damper), which holds the coil in position, but allows it 
to move freely back and forth. A permanent magnet is 
positioned directly below the voice coil. 

Sound waves are produced whenever electrical cur- 
rent flows through the voice coil, which acts as an elec- 
tromagnet. Running varying electrical current through 
the voice coil induces a varying magnetic field around 
the coil, altering the magnetization of the metal it is 
wrapped around. When the electromagnet's polar ori- 
entation switches, so does the direction of repulsion and 
attraction. In this way, the magnetic force between the 
voice coil and the permanent magnet causes the voice 
coil to vibrate, which in turn vibrates the speaker di- 
aphragm to generate sound waves. 

Figure [3jb) shows a typical MEMS microspeaker chip 
and Figure[3jc) shows the components inside the micros- 
peaker J26[[8T) . The components are similar to that of a 
basic speaker; the only difference is the size and fabrica- 
tion process |28 49 74]. The amplitude and frequency of 
the sound wave produced by the speaker's diaphragm is 
dictated respectively by the distance and rate at which the 
voice coil moves. However, due to the inevitable varia- 
tions and imperfections of the manufacturing process, no 



'imperfections can arise for the following reasons: slight variations 
in the chemical composition of components from one batch to the next, 
wear in the manufacturing machines or changes in temperature and hu- 
midity. 
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Figure 2: The internal architecture of MEMS microphone chip used in smartphones. 




two speaker are going to be alike. Thus, subtle differ- 
ences in sound generated by different speakers can arise. 
In our work, we develop techniques to computationally 
localize and evaluate these differences. 



4 Audio Features and Classification Algo- 
rithms 

In this section we briefly describe the acoustic features 
that we used in generating fingerprints. We also discuss 
the classification algorithms used in identifying the de- 
vices from which the fingerprints originated. 

4.1 Audio Features 

Given our knowledge that imperfections exist in de- 
vice audio hardware, we now need some way to detect 
them. To do this, our approach identifies acoustic fea- 
tures from an audio stream, and uses the features to con- 
struct a fingerprint of the device. Computing acoustic 
features from an audio stream is a subject of much re- 



search [20 25 61 76). To gain an understanding of how 
a broad range of acoustic features are affected by device 
imperfections we investigate a total of 15 acoustics fea- 
tures (listed in Table [TJ, all of which have been well- 



documented by researchers. A detailed description of 
each acoustic feature is available in Appendix [A] 

4.2 Classification Algorithms 

Next, we need some way to leverage the set of fea- 
tures to perform device identification. To achieve this, 
we leverage a classification algorithm, which takes ob- 
servations (features) from the observed device as input, 
and attempts to classify the device into one of several 
previously-observed sets. 

To do this, our approach works as follows. First, we 
perform a training step, by collecting a number of ob- 
servations from a set of devices. Each observation (data 
point) corresponds to a set of features observed from that 
device, represented as a tuple with one dimension per 
feature. As such, data points can be thought of as ex- 
isting in a hyper-dimensional space, with each axis cor- 
responding to the observed value of a corresponding fea- 
ture. Our approach then applies a classification algorithm 
to build a representation of these data points, which can 
later be used to associate new observations with device 
types. When a new observation is collected, the clas- 
sification algorithm returns the most likely device that 
caused the observation. 

To do this effectively, we need an efficient classi- 
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3 


Low -Energy -Rate 


1 


The percentage of frames with R]VIS power less than the average RMS power for the whole audio signal 


4 


Spectral Centroid 


' 


Represents the center of mass of a spectral power distribution 


5 


Spectral Entropy 


1 


Captures the peaks of a spectrum and their locations 


6 


Spectral Irregularity 




Measures the degree of variation of the successive peaks of a spectrum 


7 


Spectral Spread 




Defines the dispersion of the spectrum around its centroid 


8 


Spectral Skewness 




Represents the coefficient of skewness of a spectrum 


9 


Spectral Kurtosis 




Measure of the flatness or spikiness of a distribution relative to a normal distribution 


10 


Spectral Rolloff 




Defines the frequency below which 85% of the distribution magnitude is concentrated 


11 


Spectral Brightness 




Computes the amount of spectral energy corresponding to frequencies higher than a given cut-off threshold 


12 


Spectral Flatness 




Measures how energy is spread across the spectrum 


13 


MFCCs 


13 


Compactly represents spectrum amplitudes 


14 


Chromagram 


12 


Representation of the distribution of energy along the 12 distinct semitones or pitch classes 


15 


Tonal Centroid 


6 


Maps a chromagram onto a six-dimensional Hypertorus structure 



fication algorithm. In our work, we compare perfor- 
mance of two alternate approaches described below: k- 
nearest neighbors (associates an incoming data point 
with the device corresponding to the nearest "learned" 
data points), and Gaussian mixture models (computes a 
probability distribution for each device, and determines 
the maximally-likely association). 

&-NN: The ^-nearest neighbors algorithm (fe-NN) is a 
non-parametric lazy learning algorithm. The term "non- 
parametric" means that the &-NN algorithm does not 
make any assumptions about the underlying data dis- 
tribution, which is useful in analyzing real-world data 
with complex underlying distribution. The term "lazy 
learning" means that the fc-NN algorithm does not use 
the training data to make any generalization, rather all 
the training data are used in the testing phase making it 
computationally expensive (however, optimizations are 
possible). The fc-NN algorithm works by first computing 
the distance from the input data point to all training data 
points and then classifies the input data point by taking 
a majority vote of the k closest training records in the 
feature space |34| . The best choice of k depends upon 
the data; generally, larger values of k reduce the effect 
of noise on the classification, but make boundaries be- 
tween classes less distinct. We will discuss more about 
the choice of k in Section [5] 

GMM: A Gaussian mixture model is a probabilistic 
model that assumes all the data points are generated 
from a mixture of a finite number of Gaussian distri- 
butions with unknown parameters. The unknown pat- 
terns and mixture weights are estimated from training 
samples using an expectation-maximization (EM) algo- 
rithm [30 1. During the matching phase the fingerprint for 
an unknown recording is first compared with a database 
of pre-computed GMMs and then the class label of the 
GMM that gives the highest likelihood is returned as 
the expected class for the unknown fingerprint. GMMs 
are often used in biometric systems, most notably in 



human speaker recognition systems, due to their capa- 
bility of representing a large class of sample distribu- 
tions |70l|7tl . 

5 Evaluation 

In this section we perform a series of experiments to eval- 
uate how well we can fingerprint smartphones by exploit- 
ing the manufacturing idiosyncrasies of microphones and 
speakers embedded in them. We start by describing 
how we performed our experiments (Section 5.1 1. Next, 



we briefly discuss the setup for fingerprinting devices 



through speakers and microphones (Section 5.2 and 5.3 1 



We then look at fingerprinting devices made by different 
vendors (Section |5.4|> and later on focus on identifying 



devices manufactured by the same vendor (Section 5.5 1. 
We also perform an analysis of which features help most 
when identifying devices from the same vendor, by de- 
termining the dominant (most-relevant) set of audio fea- 
tures (Section [5.5.1| l. The performance of our approach 
is affected by certain aspects of the operating environ- 
ment, and we study sensitivity to such factors in Sec- 
tion 



5.1 Methodology 

To perform our experiments, we constructed a small 
testbed environment with real smartphone device hard- 
ware. In particular, our default environment consisted 
of a 266 square foot (14'xl9') office room, with nine- 
foot dropped ceilings with polystyrene tile, comprising 
a graduate student office in a University-owned building 
(used to house the computer science department). The 
room was filled with desks and chairs, and opens out on 
a public hall with foot traffic. The room also receives a 
minimal amount of ambient noise from air conditioning, 
desktop computers, and florescent lighting. We placed 
smartphones in various locations in the room. To emu- 
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Table 2: Types of phones used 



Maker 


Model 


Quantity 


Apple 


iPhone 5 


1 


Google 


Nexus 4G 


1 


Samsung 


Galaxy Note 2 


1 


Motorola 


Droid A855 


15 


Sony Ericsson 


W518 


1 



Table 3: Types of audio excerpts 



Type 


Description 


Instrumental 


Musical instruments playing together, e.g., ringtone 


Human speech 


Small segments of human speech 


Song 


Combination of human voice & instrumental sound 



late an attacker, we placed an ACER Aspire 5745 lap- 
top in the room. To investigate performance with inex- 
pensive hardware, we used the laptop's built-in micro- 
phone to collect audio samples (an attacker willing to 
purchase a higher-quality microphone may attain better 
performance). We investigate how varying this setup af- 
fects performance of the attack in Section [5To*| 

Devices and tools: We tested our device fingerprint- 
ing on devices from five different manufacturers. Table[2] 
highlights the model and quantities of the different phone 
sets used in our experiments. As we emphasized earlier 
we look at phones produced by both different and same 
manufacturer; hence the difference in quantities in Ta- 
ble 0 

We also investigate different genres of audio excerpts. 
Table [3] describes the different types of audio excerpts 
used in our experiments. Duration of the audio clips 
varies from 3 to 10 seconds. The default sampling fre- 
quency of all audio excerpts is 44.1kHz unless explic- 
itly stated otherwise. All audio clips are stored in WAV 
format using 16-bit pulse-code-modulation (PCM) tech- 
nique. 

For analysis we leverage the following audio tools and 
analytic modules: MIRtollbox (14], Netlab Qf>), Audac- 
ity (3) and the Android app Hertz (7). Both MIRtoolbox 
and Netlab are MATLAB modules providing a rich set 
of functions for analyzing and extracting audio features. 
Audacity and Hertz are mainly used for recording audio 
clips on computers and smartphones respectively. 

For analyzing and matching fingerprints we use a 
desktop machine with the following configuration: Intel 
i7-2600 3.4GHz processor with 12GiB RAM. We found 
that the average time required to match a new fingerprint 
was around 5-10 ms for A:-NN classifier and around 0.5- 
1 ms for GMM classifier. 



Evaluation metrics: 

class classification metrics 
score [75] — in our evaluation. Assuming there are fin 



We use standard multi- 
-precision, recall, and Fl- 



gerprints from n classes (i.e., n distinct phones), we first 
compute the true positive (TP) rate for each class, i.e., 
the number of traces from the class that are classified cor- 
rectly. Similarly, we compute the false positive (FP) and 
false negative (FN), as the number of wrongly accepted 
and wrongly rejected traces, respectively, for each class 
i (1 < i < n). We then compute precision, recall, and the 
Fl -score for each class using the following equations: 



Precision, Pr, 
Recall, Ret 
Fl-Score, Fl { 



TPi 



TPj + FPi 

TPj 
TPj + FNi 
2 x Pn x Re t 
Pn+Re, 



(1) 

(2) 
(3) 



The Fl-score is the harmonic mean of precision and 
recall; it provides a good measure of overall classifica- 
tion performance, since precision and recall represent a 
tradeoff: a more conservative classifier that rejects more 
instances will have higher precision but lower recall, and 
vice- vers a. To obtain the overall performance of the 
system we compute average values using the following 
equations: 



Avg. Precision, AvgPr 
Avg. Recall, AvgRe 
Avg. Fl-Score, AvgF 1 



n 
n 

2 x AvgPr x AvgRe 
AvgPr+ AvgRe 



(4) 
(5) 
(6) 



Each audio excerpt is recorded/played 10 times, 50% 
of which is used for training and the remaining 50% is 
used for testing. We report the maximum evaluations ob- 
tained by varying the number of neighbors (k) from 1 to 
5 for the A:-NN classifier and considering 1 to 5 Gaussian 
distributions per class. Since GMM parameters are pro- 
duced by the randomized EM algorithm, we perform 10 
parameter-generation runs for each instance and report 
the average classification performance]^] 

5.2 Process of Fingerprinting Speakers 

An attacker can leverage our algorithms to passively ob- 
serve audio emitted from device speakers (e.g., ring- 
tones), in public environments. To investigate this, we 
first look at fingerprinting speakers integrated inside 
smartphones. For fingerprinting speakers we record au- 
dio clips played from smartphones onto a laptop and we 
then extract acoustic features from the recorded audio ex- 
cerpts as shown in Figure [4] We look at both devices 
manufactured by different vendors and the same vendor. 



2 We also computed the 95% confidence interval, but we found it to 
be less than 0.01. 



6 



Smartphone 



La ptop/PC 



Extract 
Acoustic 
Features 



Generate 
Fingerprint 



Match 
Fingerprint 



Figure 4: Steps of fingerprinting speakers. 



5.3 Process of Fingerprinting Microphones 

Attackers may also attempt to fingerprint devices by ob- 
serving imperfections in device microphones, for exam- 
ple by convincing the user to install an application on 
their phone, which can observe inputs from the device's 
microphone. To investigate feasibility of this attack, we 
will next look at fingerprinting microphones embedded 
in smartphones. To do this, we record audio clips played 
from a laptop onto smartphones as shown in Figure [5] 
Again we look at both devices manufactured by different 
vendors and the same vendor. 



and RMS value generate good clusters for each type of 
smartphone. 

We test our fingerprinting approach using three dif- 
ferent types of audio excerpts. Each audio sample is 
recorded 10 times giving us a total of 50 samples from 
the five handsets. 50% of the samples are used for train- 
ing and the remaining 50% are used for testing, and we 
repeat this procedure for the three different types of audio 
excerpt. Table |4] summarizes our findings (values are re- 
ported as percentages). We simply use signal RMS value 
and spectral entropy as input features for the A:-NN clas- 
sifier, while for the GMM classifier we added MFCCs 
as an additional feature because doing so increased the 
GMM classifier's success rate. From Table |4] we see 
that we can successfully (with a precision rate of 100%) 
identify which audio clip came from which smartphone. 
Thus fingerprinting smartphones manufactured by differ- 
ent vendors seems very much feasible using only 2 to 3 
acoustic features. 
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Figure 5: Steps of fingerprinting microphones. 



5.4 Fingerprinting Devices From Different 
Vendors 

In this section we look at fingerprinting smartphones 
manufactured by five different vendor. We look at fin- 
gerprinting the devices through both microphone and 
speaker. 

5.4.1 Fingerprinting Speaker 

We found fingerprinting smartphones manufactured by 
different vendors is relatively simpler compared to fin- 
gerprinting devices manufactured by the same vendor. 
The main reason behind this is that the sensitivity of the 
speaker volume of different smartphones were quite dif- 
ferent making it easier to track them. Figure |6|a) shows 
an audio sample played from five different smartphones. 
As we see the signal strength of the audio signals are 
quite different from each other. Hence, simple acoustic 
features like RMS value and spectral entropy are good 
enough to obtain good clusters of data points. Figure^b) 
shows a plot of spectral entropy vs. RMS value for 50 
samples of an audio excerpt (10 samples from each hand- 
set). We see that acoustic features like spectral entropy 



iPhone 5 




0 0.5 1 1.5 2 2.5 3 3.5 4 

Google Nexus 4G 





» t» »..». I- ». » » » 




— *— 







Samsung Galaxy Note II 




0 0.5 1 1.5 2 2.5 3 3.5 4 

Sony Ericsson W518 




0 0.5 1 1.5 2 2.5 3 3.5 4 

Time (sec) 



(a) 



0.9 

a, 0.85 
p 





























* iPhone 5 


mi 




x Motorola Droid A855 






* Google Nexus 4G 






• Samsung Galaxy Note II 






■ i Sony Ericsson W5 18 





0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 

RMS value 

(b) 

Figure 6: a) Audio sample taken from five different handsets, 
b) Plotting audio samples taken from five different handsets us- 
ing acoustic features — signal RMS value and spectral entropy. 
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Table 4: Fingerprinting different smartphones using speaker 
output 



tions. 



Audio 
Type 


<:-NN 


GMM 


Features [1,5]* 


Features [1,5,13]* 


AvgPr 


AvgRe 


AvgF 1 


AvgPr 


AvgRe 


AvgF I 


Instrumental 


100 


100 


100 


100 


100 


100 


Human speech 


100 


100 


100 


100 


100 


100 


Song 


100 


100 


100 


100 


100 


100 



* Feature numbers taken from 



Table 5: Fingerprinting different smartphones using mic 



Audio 
Type 


fc-NN 


GMM 


Features [1,5]* 


Features [1,5,13]* 


AvgPr 


AvgRe 


AvgF 1 


AvgPr 


AvgRe 


AvgF I 


Instrumental 


96.7 


96 


96.3 


96.7 


96 


96.3 


Human speech 


93.3 


92 


92.6 


96.7 


96 


96.3 


Song 


96.7 


96 


96.3 


100 


100 


100 



* Feature numbers taken from Table|T| 



5.4.2 Fingerprinting Microphone 

Similar to speakers, we find microphone properties dif- 
fer quite substantially across vendors, simplifying finger- 
printing. In particular, the sensitivity of the microphones 
of the five handsets were different. As a result, when 
the same audio clip is recorded on the phones their re- 
spective RMS value and spectral entropy are distinguish- 
ably different, making it possible to fingerprint smart- 
phones through microphones. To test our hypothesis we 
again test our fingerprinting approach using three dif- 
ferent types of audio excerpts. Each audio sample is 
recorded 10 times, 50% of which are used for training 
and the remaining 50% are used for testing. Table|5]sum- 
marizes our findings (values are reported as percentages). 
We use the same set of features as we did in section l5~.4. 1 1 
and we see similar outcomes. These results suggest that 
smartphones can be successfully fingerprinted through 
microphones. 

5.5 Fingerprinting Devices of the Same 
Model 

In this section we look at fingerprinting smartphones 
manufactured by the same vendor. We found that this 
was relatively a tougher problem and as such we first ex- 
plore all the 15 acoustic features listed in Table[T]to deter- 
mine the dominating subset of features. Next, we carry 
out our fingerprinting task using the dominant subset of 
acoustic features. We again fingerprint devices through 
both microphone and speaker. Note that the audio ex- 



cerpts used for feature exploration in Section 5.5.1 and 
the ones used for evaluating our fingerprinting approach 
in Section 15.5.21 and [5.5.31 are not identical. We use dif- 
ferent audio excerpts, though belonging to the same cat- 
egories as listed in Table [5] so as to not bias our evalua- 



5.5.1 Feature Exploration 

At first glance, it seems that we should use all features 
at our disposal to identify device types. However, in- 
cluding too many features can worsen performance in 
practice, due to their varying accuracies and potentially- 
conflicting signatures. Hence, in this section, we explore 
all the 15 audio features described in Section 14.11 and 
identify the dominating subset of all the features, i.e., 
which combination of features should be used. For this 
purpose we adopt a well known machine learning strat- 
egy known as feature selection |46j|78). Feature selec- 
tion is the process of reducing dimensionality of data by 
selecting only a subset of the relevant features for use 
in model construction. The main assumption in using 
feature selection technique is that the data may contain 
redundant features. Redundant features are those which 
provide no additional benefit than the currently selected 
features. Feature selection techniques are a subset of the 
more general field of feature extraction, however, in prac- 
tice they are quite different from each other. Feature ex- 
traction creates new features as functions of the original 
features, whereas feature selection returns a subset of the 
features. Feature selection is preferable to feature extrac- 
tion when the original units and meaning of features are 
important and the modeling goal is to identify an influen- 
tial subset. When the features themselves have different 
dimensionality, and numerical transformations are inap- 
propriate, feature selection becomes the primary means 
of dimension reduction. 

Feature selection involves the maximization of an ob- 
jective function as it searches through the possible can- 
didate subsets. Since exhaustive evaluation of all possi- 
ble subsets are often infeasible (2 N for a total of N fea- 
tures) different heuristics are employed. We use a greedy 
search strategy known as sequential forward selection 
(SFS) where we start off with an empty set and sequen- 
tially add the features that maximize our objective func- 
tion. The pseudo code of our feature selection algorithm 
is described in Algorithm[T] 

The algorithm works as follows. First, we compute 
the Fl-score that can be achieved by each feature in- 
dividually. Next, we sort the feature set based on the 
achieved Fl-score in descending order. Then, we itera- 
tively add features starting from the most dominant one 
and compute the Fl-score of the combined feature sub- 
set. If adding a feature increases the Fl-score seen so 
far we move on to the next feature, else we remove the 
feature under inspection. Having traversed through the 
entire set of features we return the subset of features that 
maximizes our device classification task. Note that this is 
a greedy approach, therefore, the generated subset might 
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Table 6: Feature exploration using sequential forward selection technique 



# 


Feature 


Avg. Feature-Extraction Time (msec) 


Maximum Fl -Score (%) 


Instrumental 


Human Speech 


Song 


Instrumental 


Human Speech 


Song 


<r-NN 


GMM 


A--NN 


GMM 


k-NN 


GMM 


1 


RMS 


9.26 


10.01 


11.23 


21.8 


17 


37.9 


34.4 


20.1 


26.2 


2 


ZCR 


9.48 


10.61 


12.57 


17.3 


15.2 


34.4 


31.6 


13 


7.2 


3 


Low-Energy-Rate 


29.28 


32.62 


39.27 


9.4 


39.6 


18.3 


13.7 


21.8 


19.2 


4 


Spectral Centroid 


79.40 


79.61 


88.51 


39.4 


37.3 


33.8 


30.8 


39.9 


40.3 


5 


Spectral Entropy 


57.54 


46.58 


61.88 


39.6 


39.6 


30.4 


38.7 


33.9 


26.1 


6 


Spectral Irregularity 


6519.89 


2387.04 


15348.45 


36 


32.2 


23.8 


25.4 


14.1 


14.8 


7 


Spectral Spread 


80.12 


69.19 


108.23 


44.4 


39.6 


35.2 


31.7 


35.2 


38.4 


8 


Spectral Skewness 


120.29 


109.26 


179.33 


32 


41.7 


30.1 


34.3 


31.5 


40.4 


9 


Spectral Kurtosis 


136.86 


131.17 


154.03 


43 


39.6 


34.2 


39.2 


31.1 


36.8 


10 


Spectral Rolloff 


73.16 


52.08 


65.70 


57.3 


50.6 


29 


30.5 


38.7 


39.4 


11 


Spectral Brightness 


63.91 


45.51 


59.94 


23.5 


19.9 


33.5 


33.5 


18.5 


17.9 


12 


Spectral Flatness 


76.48 


57.38 


71.79 


41.9 


35.8 


37.1 


39.4 


32.4 


29.8 


13 


MFCCs 


268.86 


229.81 


413.16 


92.4 


97.2 


98.8 


98.8 


90 


91.4 


14 


Chromagram 


56.07 


76.87 


69.68 


57.1 


49.6 


95.2 


96.7 


80.1 


79.7 


15 


Tonal Centroid 


79.54 


99.95 


85.79 


57.1 


46.1 


93.7 


95.2 


63.6 


53.7 


Sequential Feature Selection 


[13,14] 
96.3 


[13,14] 
97.7 


[13] 
98.8 


[13,14] 
100 


[13,7] 
92.6 


[13,14] 
94.1 



Algorithm 1 Sequential Feature Selection 

Input: Input feature set F 
Output: Dominant feature subset D 
F I score i— [] 
for / G F do 

F\_score[f] 4— Classify(f) 
end for 

F' <— sort (F,F\ score) #In descending order 

maxscore <— 0 

D<r- 0 

for / G F' do 

D^DUf 

temp <S— Classify(D) 

if temp > maxscore then 
maxscore <S— temp 

else 
D^D-{f] 

end if 
end for 
return D 



not always provide optimal Fl -score. However, for our 
purpose, we found this approach to perform well, as we 
demonstrate in latter sections. 

We test our feature selection algorithm for all three 
types of audio excerpts listed in Table[3] We evaluate the 
Fl-score using both fc-NN and GMM classifiers. Table|6] 
highlights the maximum Fl-score obtained by varying 
k from 1 to 5 (for fc-NN classifier) and also considering 
1 to 5 gaussian distributions per class (for GMM clas- 
sifier). To obtain the fingerprinting data we record au- 
dio clips played from 15 Motorola Droid A855 handsets. 
Each type of audio is recorded 10 times giving us a to- 
tal of 150 samples from the 15 handsets; 50% of which 



(i.e., 5 samples per class) are used for training and the 
remaining 50% are used for testing. All the training sam- 
ples are labeled with their corresponding handset identi- 
fier. Both classifiers return the class label for each audio 
clip in the test set and from that we compute Fl-score. 
Table [6] shows the maximum Fl-score achieved by each 
acoustic feature for the three different types of audio ex- 
cerpt. We also provide the time required to extract each 
feature. The table highlights the subset of features se- 
lected by our sequential feature selection algorithm and 
their corresponding Fl-score. We find that MFCCs are 
the dominant feature for all category of audio excerpt. 
Chromagram also generates high Fl-score. 

To give a better understanding of why MFCCs are the 
dominant acoustic features we plot the MFCCs of a given 
audio excerpt from three different handsets on Figure [7] 
All the coefficients are ranked in the same order for the 
three handsets. We can see that the magnitude of the co- 
efficients vary across the three handsets. For example 
coefficients 3 and 5 vary significantly across the three 
handsets. This makes MFCCs a prevalent choice for fin- 
gerprinting smartphones. 

5.5.2 Fingerprinting Speakers 

We now look at fingerprinting smartphones manufac- 
tured by the same vendor. For these set of experiments 
we use 15 Motorola Droid A855 handsets. Table|7]high- 
lights our findings. We again test our fingerprinting ap- 
proach against three different forms of audio excerpt. We 
use the acoustic features obtained from our sequential 
feature selection algorithm as listed in Table [6] From 
Table [7] we see that we can achieve an Fl-score of 
over 94% in identifying which audio clip originated from 
which handset. Thus fingerprinting smartphones through 
speaker seems to be a viable option. 
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Figure 7: MFCCs of the same audio sample taken from three different handsets manufactured by the same vendor. We can see that 
some of the coefficients vary significantly, thus enabling us to exploit this feature to fingerprint smartphones. 



Table 7: Fingerprinting similar smartphones using speaker out- 
put 



Audio 


k-NN 


GMM 


Type 


Features* 


AvgPr 


AvgRe 


AvgFl 


Features* 


AvgPr 


AvgRe 


AvgF 1 


Instrumental 


[13.14] 


96.7 


96 


96.3 


[13.14] 


98.3 


98 


98.1 


Human speech 


[13] 


98.9 


98.7 


98.8 


[13,14] 


98.9 


98.7 


98.8 


Song 


[13,7] 


93.7 


92 


92.8 


[13,14] 


95.6 


93.3 


94.4 



* Feature numbers taken from Tableful 



5.5.3 Fingerprinting Microphone 

We now investigate fingerprinting smartphones made by 
the same vendor through microphone-sourced input. We 
use 15 Motorola Droid A855 handsets for these experi- 
ments. We use the features obtained through Algorithm[T] 
which are listed in Tabled Table|8]summarizes our find- 
ings. We see similar results compared to fingerprinting 
speakers. We were able to achieve an Fl-score of 93% 
in identifying the handset from which the audio excerpt 
originated. Thus fingerprinting smartphones through mi- 
crophones also appears to be feasible. 



Table 8: Fingerprinting similar smartphones using microphone 



Audio 


&-NN 


GMM 


Type 


Features* 


AvgPr 


AvgRe 


AvgF 1 


Features* 


AvgPr 


AvgRe 


AvgFl 


Instrumental 


[13.14] 


93.7 


92 


92.8 


[13.14] 


94.1 


92 


93 


Human speech 


[13] 


98.9 


98.7 


98.8 


[13.14] 


98.9 


98.7 


98.8 


Song 


[13,7] 


93.9 


93.3 


93.6 


[13,14] 


96.1 


95.2 


95.6 



* Feature numbers taken from Table|^ 



5.6 Sensitivity Analysis 

In this section we investigate how different factors such 
as audio sampling rate, training set size, the distance 
from audio source to recorder, and background noise im- 
pact our fingerprinting performance. Such investigations 
will help us determine the conditions under which our 
fingerprinting approach will be feasible. For the follow- 
ing set of experiments we will only focus on fingerprint- 
ing smartphones from the same vendor and we only con- 
sider fingerprinting speakers as we saw almost identical 
outcomes for fingerprinting microphones. We also con- 



sider recording only ringtones (i.e., an audio clip belong- 
ing to our defined Instrumental category) for the follow- 
ing experiments. Since we are recording ringtones we 
only use the features highlighted in Table [6] under 'In- 
strumental' category. 

5.6.1 Impact of Sampling Rate 

First, we investigate how the sampling rate of audio sig- 
nals impacts our fingerprinting precision. To do this, 
we record a ringtone at the following three frequencies: 
8kHz, 22.05kHz and 44.1kHz. Each sample is recorded 
10 times with half of them being used for training and 
the other half for testing. Figure|8]shows the average pre- 
cision and recall obtained under different sampling rates. 
As we can see from the figure, as sampling frequency de- 
creases, the precision/recall also goes down. This is un- 
derstandable, because the higher the sampling frequency 
the more fine-tuned information we have about the au- 
dio sample. However, the default sampling frequency 
on most handheld devices today is 44.1kHz [4], with 
some of the latest models adopting even higher sampling 
rates (Tl. We, therefore, believe sampling rate will not 
impose an obstacle to our fingerprinting approach, and 
in future we will be able to capture more fine grained 
variations with the use of higher sampling rates. 




44.1 22.05 8 

Sampling Frequency (kHz) 



Figure 8: Impact of sampling frequency on precision/recall. 
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5.6.2 Varying Training Size 

Next, we consider performance of the classifiers in the 
presence of limited training data. For this experiment we 
vary the training set size from 10% to 50% (i.e., from 1 
to 5 samples per class) of all available samples. Table [9] 
shows the evolution of the Fl -score as training set size 
is increased (values are reported as percentages). We see 
that as the training set size increases the Fl-score also 
rises which is expected. However, we see that with only 
three samples per class we can achieve an Fl-score of 
over 90%. This suggests that we do not need too many 
training samples to construct a good predictive model. 

Table 9: Impact of varying training size 



Training 


k-NH 


GMM 


samples 


Features [13,14]* 


Features [13,14]* 


per class 


AvgPr 


AvgRe 


AvgFl 


AvgPr 


AvgRe 


AvgFl 


1 


42 


49.3 


45.3 


50 


53.3 


51.6 


2 


79.2 


80 


79.6 


80.4 


80 


80.2 


3 


91.3 


89.3 


90.2 


91.7 


89.3 


90.5 


4 


95.3 


94.7 


95 


95.6 


94.7 


95.1 


5 


96.7 


96 


96.3 


98.3 


98 


98.1 



< Feature numbers taken from Tablelol 



5.6.3 Varying Distance between Audio Source and 
Recorder 

Next, we inspect the impact of distance between the 
audio source (i.e., smartphone) and recorder (i.e., lap- 
top/PC) on fingerprinting precision/recall. For this ex- 
periment we use a separate external microphone as 
the signal capturing capacity of microphones embed- 
ded inside laptops degrades drastically as distance in- 
creases. We use the relatively inexpensive ($44.79) 
Audio-Technica ATR-6550 shotgun microphone for this 
experiment and vary the distance between the external 
microphone and smartphone from 0.1 meter to 5 me- 
ters. Figure|9]shows the experimental setup and Table 10 
summarizes the Fl -scores obtained as the distance be- 
tween the smartphone and microphone varies. We see 
that as distance increases, Fl-score decreases. This is ex- 
pected, because as the distance between the smartphone 
and microphone increases, the harder it becomes to cap- 
ture the minuscule deviations between audio samples. 
However, we see that even up to two meters distance we 
can achieve an Fl-score of 93%. This suggests that our 
device fingerprinting approach works only up to a certain 
distance using any commercial microphones. However, 
using specialized microphones, such as parabolic micro- 
phones (usually used in capturing animal sounds from a 
far distance) could help in increasing the fingerprinting 
precision at even longer distances. 




Figure 9: Experimental setup for varying the distance between 
the smartphone and microphone. 

Table 10: Impact of varying distance 



Dintance 
(meters) 


k-m 


GMM 


Features [13,14]* 


Features [13,14]* 


AvgPr 


AvgRe 


AvgFl 


AvgPr 


AvgRe 


AvgFl 


0.1 


96.7 


96 


96.3 


98.3 


98 


98.1 


1 


92.7 


91.5 


92 


95.2 


94.7 


94.9 


2 


88.2 


87.6 


87.9 


94.5 


92 


93.2 


3 


76.7 


76 


76.3 


78.9 


84 


81.4 


4 


70.2 


64 


67 


76.8 


76 


76.4 


5 


64.5 


62.7 


63.6 


77 


73.3 


75.1 



* Feature numbers taken from Tablel^l 

5.6.4 Impact of Ambient Background Noise 

In this section we investigate how ambient background 
noise impacts the performance of our fingerprinting tech- 
nique. For this experiment we consider scenario types 
were there is a crowd of people using their smart de- 
vices and we are trying to fingerprint those devices by 
capturing audio signals (in this case ringtones) from the 



surrounding environment. Table 11 highlights the four 
different scenarios that we are considering. To capture 
audio signals under such scenarios - external speakers 
(2 pieces), placed between the smartphone and micro- 
phone, were constantly replaying the respective ambi- 
ent noise in the background while recording of ringtones 
played from different handsets were taking place. We 
consider a distance of two meters from the audio source 
to recorder. The ambient background sounds were ob- 
tained from PacDV (2) and SoundJay (17) ■ Table [TT] 
shows our findings (values are reported as percentages). 
We can see that even in the presence of various back- 
ground noise we can achieve an Fl-score of over 91%. 

Table 11: Impact of ambient background noise 



Environments 


i-NN 


GMM 


Features [13,14]* 


Features [13,14]* 


AvgPr 


AvgRe 


AvgF 1 


AvgPr 


AvgRe 


AvgFl 


Shopping Mall 


88.8 


85.3 


87 


95.1 


93.3 


94.2 


Restaurant/Cafe 


90.5 


89.7 


90.1 


92.5 


90.7 


91.6 


City Park 


91.7 


90 


90.8 


95.2 


94.1 


94.6 


Airport Gate 


91.3 


89.5 


90.4 


94.5 


93.3 


93.9 



* Feature numbers taken from Table[f7| 
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6 Applications 



6.2 Device Tracking 



Fingerprinting smart devices can be thought of as a 
double-edged sword when it comes to device security. 
On one hand, it can jeopardize privacy, as it allows re- 
mote identification without user awareness. On the other 
hand, it could potentially be used to enhance authenti- 
cation of physical devices. We discuss these potential 
applications below. 

6.1 Multi-factor Authentication 

Conventional computing systems authenticate users by 
verifying some static factors such as user generated pass- 
words (which may be coupled with additional security 
questions like pin code or phone number). A password 
consists of a string of characters, remembered by the 
human user, which can be provided as a proof of iden- 
tity. However, passwords are vulnerable to guessing al- 
gorithms. Moreover, if passwords ever leak they poten- 
tially open an opportunity for an unauthenticated user to 
get access to the system. Often systems do not incorpo- 
rate mechanisms to verify whether the authenticated user 
is using an authorized device. Modern highly-secure or- 
ganizations (e.g., military and department of defense) are 
therefore, moving towards using various forms of active 
authentication for their employees (3). 

Device fingerprinting can be used to provide a multi- 
factor authentication framework that will enable a sys- 
tem administrator to validate whether authenticated users 
are using their allocated devices to log in into the sys- 
tem. This scenario of course is applicable to high- 
security conscious organizations where tracking authen- 
ticated users is not against any privacy violation. This 
can be done by leveraging our techniques, for example 
by instructing the user's device to record an audio sam- 
ple broadcast over the PA system, or transmit an audio 
session over the phone. Alternatively, the device may be 
able to "fingerprint itself", by playing a received small 
audio clip out its speaker, simultaneously recording via 
its microphone, and then transmitting the result over the 
network to the authentication server for verificatior0 In 
this way we can tie both user and device identity together 
to form a multi-factor authentication framework. As a 
side note this only provides additional assurance, rather 
than a foolproof authentication method. However, we 
believe our approach is more robust than existing soft- 
ware based two-factor authentication systems (e.g., for 
systems where you need to submit a token along with a 
password, if the attacker gets hold off the secret key then 
he/she can generate the desired token) as it is harder to 
mimic hardware level imperfections. 



We are assuming that the user is not using a headphone at the start 
of authentication. 



By the same token, an attacker can violate user privacy 
by via a similar approach, or installing a malicious ap- 
plication on the user's device, or recording broadcasted 
audio in public environments. For example, a malicious 
application (e.g., a game) can play small audio segments, 
record them using the device's microphone, and upload 
recorded clips to the attacker. To do this, the application 
would require access to both microphone and network 
access permission, but this might not be a big assump- 
tion to make: most users are unaware of the security risks 
associated with mobile apps and a significant portion of 
the users cannot fully comprehend the full extension of 
all the permissions [38 53). 

Alternatively, the attacker may sit in public environ- 
ments (cafe, shopping mall), and record broadcasted au- 
dio (speakerphone conversations, ringtones) with the in- 
tent to track and identify users. 

7 Related Work 

Fingerprints have long been used as one of the most com- 
mon biometrics in identifying human beings (29} |72) . 
The same concept was extended to identifying and track- 
ing unique mobile transmitters by the US government 
during 1960s (55) . Later on with the emergence of cellu- 
lar networks people were able to uniquely identify trans- 
mitters by analyzing the externally observable character- 
istics of the emitted radio signals (7T) . 

Physical devices are usually different at either the soft- 
ware or hardware level even if they are produced by the 
same vendor. In terms of software based fingerprint- 
ing researchers have looked at fingerprinting techniques 
that differentiates between unique devices over a Wire- 
less Local Area Network (WLAN) simply through a tim- 
ing analysis of 802.11 probe request frames [3T) . Oth- 
ers have looked at exploiting the difference in firmware 
and device driver running on IEEE 802. 1 1 compliant de- 
vices [39] . 802.11 MAC headers have also been used to 
track unique devices |44|. Pang et al. (67) were able to 
exploit traffic patterns to carry out device fingerprinting. 
Open source toolkits like Nmap [59 1 and Xprobe (79) can 
remotely fingerprint an operating system by identifying 
unique responses from the TCP/IP networking stack. 

Another angle to software based fingerprinting is to 
exploit applications like browsers to carry out device 
fingerprinting (35) . Yen et al. (80) were successful at 
tracking users with high precision by analyzing month- 
long logs of Bing and Hotmail. Researchers have also 
been able to exploit JavaScript and popular third-party 
plugins like Flash player to obtain the list of fonts in- 
stalled in a device which then enabled them to uniquely 
track users [ 19 1. Other researchers have proposed the use 
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of performance benchmarks for differentiating between 
JavaScript engines (64). Furthermore, browsing history 
can be exploited to fingerprint and track web users (66) . 
The downside of software based fingerprints is that such 
fingerprints are generated from the current configuration 
of the system which is not static, rather it is likely to 
change over time. 

Hardware based fingerprinting approaches rely on 
some static source of idiosyncrasies. It has been 
shown that network devices tends to have constant clock 
skews [63] and researchers have been able to exploit 
these clock skews to distinguish devices through TCP 
and ICMP timestamps (54). However, clock skew rate 
is highly dependent on the experimental environment. 
Researchers have also extensively looked at fingerprint- 
ing the unique transient characteristics of radio transmit- 
ters (also known as Radio Frequency (RF) fingerprint- 
ing). RF fingerprinting has been shown as a means of 
enhancing wireless authentication [22 . 57 65 77). It has 
also been used for location detection ]68|. Manufac- 
turing imperfections in network interface cards (NICs) 
have also been studied by analyzing analog signals trans- 
mitted from them |23 41 1. More recently Dey et al. 



have studied manufacturing idiosyncrasies inside smart- 
phone accelerometer to distinguish devices [32]. How- 
ever, their approach requires some form of external stim- 
ulation/vibration to successfully capture the manufactur- 
ing imperfection of the on-board accelerometer. More- 
over, there are different contexts in which audio prints 
can be more useful, e.g., software that is not allowed to 
access the accelerometer, as well as an external adversary 
who fingerprints nearby phones with a microphone. 

Our work is inspired by the aforementioned hardware 
based fingerprinting works, but instead of focusing on 
wireless transmitters or on-board sensors that require ex- 
ternal stimulation, we focus on fingerprinting on-broad 
acoustic components like microphones and speakers. 

Audio fingerprinting has a rich history of notable re- 
search works (25) . There are studies that have looked at 
classifying audio excerpts based on their content | 45f76") . 
Others have looked at distinguishing human speakers 
from audio segments |21 24). There has also been work 
on exploring various acoustic features for audio classifi- 
cation 1 61 1. One of the more popular applications of au- 
dio fingerprinting has been music genre and artist recog- 
nition 



Our work takes advantage of the large set of acoustic 
features that have been explored by the aforementioned 
works. However, instead of classifying the content of 
audio segments, we are utilizing the acoustics features to 
capture the manufacturing imperfections of microphones 
and speakers embedded in smart devices. 



8 Discussion and Limitations 

Our approach has several limitations. First, we experi- 
mented with 15 devices manufactured by the same ven- 
dor; it is possible that a larger target device pool would 
result in lower accuracy. That said, distinctions across 
different device types are more clear; additionally, au- 
dio fingerprints may be used in tandem with other tech- 
niques, such as accelerometer fingerprinting |32|, to bet- 
ter discriminate between devices. Secondly, we evalu- 
ated our fingerprinting precision/recall under only two 
types of classifiers (GMM and A:-NN). Other forms of 
classification such as ensemble based approaches could 
possibly achieve better results, as ensemble based meth- 
ods use multiple models to obtain better predictive per- 
formance than any single classifier (33). However, as 
a first step we were able to achieve over 93% precision 
using simple k-NN and GMM classifiers, and our re- 
sults may point to the concern that relatively simple tech- 
niques have a high success rate. Lastly, all the phones 
used in our experiments were not in mint condition and 
some of the idiosyncrasies of individual microphones 
and speakers may have been the result of uneven wear 
and tear on each device; we believe, however, that this is 
likely to occur in the real world as well. 

9 Conclusion 

In this paper we show that it is feasible to finger- 
print smart devices through on-board acoustic compo- 
nents like microphones and speakers. As microphones 
and speakers are one of the most standard components 
present in almost all smart devices available today, this 
creates a key privacy concern for users. By the same to- 
ken, efficient fingerprinting may also serve to enhance 
authentication. To demonstrate feasibility of this ap- 
proach, we collect fingerprints from five different brands 
of smartphones, as well as from 15 smartphones manu- 
factured by the same vendor. Our studies show that it is 
possible to successfully fingerprint smartphones through 
microphones and speakers, not only under controlled en- 
vironments, but also in the presence of ambient noise. 
We believe our findings are important steps towards un- 
derstanding the full consequences of fingerprinting smart 
devices through acoustic channels. 
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A Audio Features 

Root-Mean-Square (RMS) Energy: This feature com- 
putes the square root of the arithmetic mean of the 
squares of the original audio signal strength at vari- 
ous frequencies. In the case of a set of N values 
{x\,X2, • • • ,*n}, the RMS value is given by the follow- 
ing formula: 



*i-ms = ]J-(x 2 i+ X 2^ \~ X n) ( 7 ) 

The RMS value provides an approximation of the aver- 
age audio signal strength. 

Zero Crossing Rate (ZCR): The zero-crossing rate is 
the rate at which the signal changes sign from positive to 



negative or back (27]. This feature has been used heav- 
ily in both speech recognition and music information re- 
trieval, for example to classify percussive sounds (43). 
ZCR for a signal s of length T can be defined as: 

ze* = ij>(f)-i(r-i)| (8) 
1 t=i 

where s(t) = 1 if the signal has a positive amplitude at 
time / and 0 otherwise. Zero-crossing rates provide a 
measure of the noisiness of the signal. 

Low Energy Rate: The low energy rate computes the 
percentage of frames (typically 50ms chunks) with RMS 
power less than the average RMS power for the whole 
audio signal. For instance, a musical excerpt with some 
very loud frames and a lots of silent frames would have 
a high low-energy rate. 

Spectral Centroid: The spectral centroid represents 
the "center of mass" of a spectral power distribution. It 
is calculated as the weighted mean of the frequencies 
present in the signal, determined using a fourier trans- 
form, with their magnitudes as the weights: 

f 

Centroid, ll = i= l 1 (9) 

where m, represents the magnitude of bin number i, and 
/, represents the center frequency of that bin. 

Spectral Entropy: Spectral entropy captures the spik- 
iness of a spectral distribution. As a result spectral en- 
tropy can be used to capture the formants or peaks in 
the sound envelope [62]. To compute spectral entropy, 
a Digital Fourier Transform (DFT) of the signal is first 
carried out. Next, the frequency spectrum is converted 
into a probability mass function (PMF) by normalizing 
the spectrum using the following equation: 



where m, represents the energy/magnitude of the i-th fre- 
quency component of the spectrum, w = (w\,W2, ■ ■ ■ ,wn) 
is the PMF of the spectrum and N is the number of points 
in the spectrum. This PMF can then be used to compute 
the spectral entropy using the following equation: 

N 

ff = £ Wi • log 2 Wi (11) 

The central idea of using entropy as a feature is to capture 
the peaks of the spectrum and their location. 

Spectral Irregularity: Spectral irregularity measures 
the degree of variation of the successive peaks of a spec- 
trum. This feature provides the ability to capture the jitter 
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or noise in spectra. Spectral irregularity is computed as 
the sum of the square of the difference in amplitude be- 
tween adjoining spectral peaks |50| using the following 
equation: 



Irregularity ■ 



-fli+i) 2 



(12) 



where the (N + l)-th peak is assumed to be zero. A 
change in irregularity changes the perceived timbre of 
a sound. 

Spectral Spread: Spectral spread defines the dispersion 
of the spectrum around its centroid, i.e., it measures the 
standard deviation of the spectral distribution. So it can 
be computed as: 



Spread,a = W £[(/;- M) 2 • (13) 



where Wj represents the weight of the z-th frequency com- 



ponent obtained from equation ( 10 1 and fi represents the 



centroid of the spectrum obtained from equation (|9jl. 

Spectral Skewness: Spectral skewness computes the 
coefficient of skewness of a spectrum. Skewness (third 
central moment) measures the symmetry of the distribu- 
tion. A distribution can be positively skewed in which 
case it has a long tail to the right while a negatively- 
skewed distribution has a longer tail to the left. A sym- 
metrical distribution has a skewness of zero. The coef- 
ficient of skewness is the ratio of the skewness to the 
standard deviation raised to the third power. 



Skewness — 



(14) 



Spectral Kurtosis: Spectral Kurtosis gives a measure 
of the flatness or spikiness of a distribution relative to a 
normal distribution. It is computed from the fourth cen- 
tral moment using the following function: 



Kurtosis — 



If =1 [(fj~ n) 4 ■ Wj] 



(15) 



A kurtosis value of 3 means the distribution is similar 
to a normal distribution whereas values less than 3 refer 
to flatter distributions and values greater than 3 refers to 
steeper distributions. 

Spectral Rolloff: The spectral rolloff is defined as the 
frequency below which 85% of the distribution magni- 
tude is concentrated 1761 



fc n 
arg min ^ rrii > 0.85 ■ ^ m\ 
f c e{i,...p} i=\ ,=i 



(16) 



where f c is the rolloff frequency and m, is the magnitude 
of the z-th frequency component of the spectrum. The 



rolloff is another measure of spectral shape that is corre- 
lated to the noise cutting frequency [69| . 

Spectral Brightness: Spectral brightness calculates the 
amount of spectral energy corresponding to frequencies 
higher than a given cut-off threshold. This metric cor- 
relates to the perceived timbre of a sound. Increase of 
higher frequency energy in the spectrum yields a sharper 
timbre, whereas a decrease yields a softer timbre (52). 
Spectral brightness can be computed using the following 
equation: 

N 

Brightness f c — ^ mi (17) 

i=fc 

where f c is the cut-off frequency (set to 1500Hz) and m, 
is the magnitude of the z-th frequency component of the 
spectrum. 

Spectral Flatness: Spectral flatness measures how en- 
ergy is spread across the spectrum, giving a high value 
when energy is equally distributed and a low value when 
energy is concentrated in a small number of narrow fre- 
quency bands. The spectral flatness is calculated by di- 
viding the geometric mean of the power spectrum by the 
arithmetic mean of the power spectrum [ 5 1 1 : 



Flatness — 



l/N 



(18) 



where z«, represents the magnitude of bin number i. 
Spectral flatness provides a way to quantify the noise- 
like or tone-like nature of the signal. One advantage of 
using spectral flatness is that it is not affected by the am- 
plitude of the signal, meaning spectral flatness virtually 
remains unchanged when the distance between the sound 
source and microphone fluctuates during recording. 

Mel-Frequency Cepstrum Coefficients (MFCCs): 

MFCCs are short-term spectral features and are widely 
used in the area of audio and speech processing |58 76) . 
Their success has been due to their capability of com- 



pactly representing spectrum amplitudes. Figure 10 
highlights the procedure for extracting MFCCs from au- 
dio signals. The first step is to divide the signal into fixed 
size frames (typically 50ms chunks) by applying a win- 
dowing function at fixed intervals. The next step is to 
take Discrete Fourier Transform (DFT) of each frame. 
After taking the log-amplitude of the magnitude spec- 
trum, the DFT bins are grouped and smoothed accord- 
ingto the perceptually motivated Mel-frequency scal- 
ing Finally, in order to decorrelate the resulting feature 
vectors a discrete cosine transform is performed. We use 
the first 13 coefficients for our experiments. 



Mel-scale approximates the human auditory response more closely 
than the linearly-spaced frequency bands, http://en.wikipedia.org/wiki/ 
MeLscale 
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Figure 10: Procedure for extracting MFCCs from audio signals 



Chromagram: A chromagram (also known as har- 
monic pitch class profile) is a 12-dimensional vector rep- 
resentation of an audio signal showing the distribution of 
energy along the 12 distinct semitones or pitch classes. 
First a DFT of the audio signal is taken and then the 
spectral frequencies are mapped onto a limited set of 12 
chroma values in a many-to-one fashion [40 1. In general, 
chromagrams are robust to noise (e.g., ambient noise or 
percussive sounds) and independent of timbre change. 

Tonal Centroid: Tonal centroid introduced by Harte 
et al. [48 1 maps a chromagram onto a six-dimensional 
Hypertoms structure. The resulting representation wraps 
around the surface of a Hypertoms, and can be visual- 
ized as a set of three circles of harmonic pitch intervals: 
fifths, major thirds, and minor thirds. Tonal centroids are 
efficient in detecting changes in harmonic contents. 
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